Round-Efficient Secure Inference Based on Masked Secret Sharing for Quantized Neural Network

Existing secure multiparty computation protocol from secret sharing is usually under this assumption of the fast network, which limits the practicality of the scheme on the low bandwidth and high latency network. A proven method is to reduce the communication rounds of the protocol as much as possible or construct a constant-round protocol. In this work, we provide a series of constant-round secure protocols for quantized neural network (QNN) inference. This is given by masked secret sharing (MSS) in the three-party honest-majority setting. Our experiment shows that our protocol is practical and suitable for low-bandwidth and high-latency networks. To the best of our knowledge, this work is the first one where the QNN inference based on masked secret sharing is implemented.


Introduction
As an essential application of machine learning as a service (MLaaS) [1], neural network inference is widely used in image recognition [2,3], medical diagnosis [4], and so on. In the traditional MLaaS paradigm, the model owner provides a trained neural network, and a user, who holds some queries, calls an API of MLaaS to enjoy the inference service. However, with the increase in people's privacy awareness and the perfection of laws and regulations [5], the traditional MLaaS paradigm is being challenged. On the one hand, the user is unwilling to reveal queries and inference results to the model owner. On the other hand, the trained model is intellectual property belonging to the model owner and cannot be revealed to the user. Secure inference utilizes cryptographic techniques to ensure that sensitive information is not revealed to each other.
In general, different cryptographic tools have different concerns. Fully homomorphic encryption (FHE) is communication-efficient but computation-expensive, which makes it unpractical [6]. As an important component of secure multiparty computation (MPC), secret sharing (SS) is computation-efficient but more communication rounds are required [7,8]. Existing works from secret sharing are usually under this assumption of the fast network, which has a high-bandwidth and low-latency network, for example, in the local area network (LAN) setting. However, all these works are inefficient in low-bandwidth and highlatency networks, even under the semi-honest model. A fast network is difficult to achieve in the real world, especially in the wide area network (WAN) setting. A proven method is to reduce communication rounds of the protocol as much as possible or construct protocols with constant rounds. In addition, these methods are also important for computationally intensive neural network inference.
Recently, QNN has gained much attention. The quantization technique reduces the overall model computational overhead by limiting the representation bit-width of data and parameters in the model at the expense of a certain level of model accuracy. More precisely, quantization converts the float-point arithmetic (FP32, 32-bit floating point, single • We provide a series of constant-round communication complexity secure protocols for QNN inference, including secure truncation, conversion, and clamping protocol. We achieve this by constructing protocols based on MSS. • We give detailed proof of security in the semi-honest model. Concretely, our protocols are secure against one single corruption. • The experiment shows that our protocols are practical and suitable for the high-latency network. Compared to the previous work for quantized inference, our protocols are 1.5 times faster in the WAN setting.
The remainder of this work is organized as follows. In Section 2, we define notations and primitives related to cryptographic tools, security model, neural networks, and quantization. In Section 3, we show the architecture for QNN secure inference. In Section 4, we give several building blocks of QNN inference and provide security analysis of our protocols. In Section 5, we provide our QNN structure. In Section 6, we implement our protocols and then report the experimental results. Finally, we conclude this work in Section 7.

Basic Notations
At first, we define the notations used in this work in Table 1.

Notation Description
The computational security parameter P j The computing party, where j ∈ {0, 1, 2} A The tensor or matrix a The vector The logarithm of the ring size Z 2 , Z 2 The integer ring and the boolean ring

Threat Model and Security
In this work, we consider three non-colluding servers as the computing parties of MPC to execute secure inference tasks, where static, semi-honest adversary A corrupts only a single party during the protocol execution. The semi-honest adversary corrupts one of three parties and obtains its view (including its input, random tape, and received messages during the protocol execution), but follows the protocol specification exactly.
Our protocols rely on secure pseudo-random function (PRF), and thus, we can only provide security against a computationally bounded adversary; hence, all our protocols are computationally secure. Formally, we can define semi-honest security as follows: Definition 1 (Semi-honest Security [18]). Let Π be a three-party protocol in the real world, F : ({0, 1} * ) 3 → ({0, 1} * ) 3 be the ideal funcationality in the ideal world. We say Π securely computes F in presence of a single semi-honest adversary if for every corrupted party P i (i ∈ {0, 1, 2}) and every input x ∈ ({0, 1} * ) 3 , there exists an efficient simulator Sim such that In other words, a protocol Π is computationally secure in the semi-honest model, if and only if the view of the ideal world simulator and the view of the real world adversary is computationally indistinguishable.

Secret Sharing Semantics
Let x be the secret. Similar to [19], we use the following sharing in this work.
• · -sharing: ASS among P 1 and P 2 . The dealer samples random elements x 1 , x 2 ∈ R Z 2 as the shares of x, such that x = x 1 + x 2 mod 2 holds. The dealer distributes the shares to each party such that P i for i ∈ {1, 2} holds x i . For simplicity, we denote x i as the additive shares of P i , and x := (x 1 , x 2 ). • · -sharing: MSS among all parties. The dealer samples random element λ x ∈ R Z 2 , computes m x = x + λ x mod 2 , and then shares λ x = λ x 1 + λ x 2 among P 1 and P 2 by · -sharing. The dealer distributes the shares to each party, such that P 0 holds ( λ x 1 , λ x 2 ), P 1 holds (m x , λ x 1 ), and P 2 holds (m x , λ x 2 ). For simplicity, we denote x i as the masked shares of P i , and x := (m x , λ x 1 , λ x 2 ). Table 2 summarizes the individual shares of the parties for the aforementioned secret sharing. It is easy to see that each party only misses one share to reconstruct the secret x.

Scheme
Notation The above steps can also be extended to Z 2 by replacing addition/subtraction with XOR and multiplication with AND. We use both Z 2 and Z 2 as the computation fields and refer to the shares as arithmetic sharing and boolean sharing, respectively. We denote the Boolean sharing with B in the superscript, which means the Boolean sharing of bit b is b B and b B depending on the type of sharing semantics.
Note that both · -sharing and · -sharing satisfy the linearity property, which allows the parties to compute the linear combination of two shared values non-interactively. We only introduce the basic operations of MSS in this section. To reduce communication costs, F Rand is used (cf. Appendix A).

•
For linear combination z = cx ± dy ± e, the parties locally compute its shares to be For multiplication z = xy, we denote as functionality F Mul , then Π Mul can be achieved as follows [19]: 1. P 0 and P 1 locally sample random λ z 1 and γ xy 1 by using F Rand ; 2.
P 0 and P 2 locally sample random λ z 2 by using F Rand ; 3.
P i for i ∈ {1, 2} sends m z i to P 3−i , who locally computes m z = m z 1 + m z 2 .
It is easy to see that the multiplication requires communication of at most 3 bits and 2 rounds. Note that steps 1-3 are independent of the secret x and y, which can be improved by using the offline-online paradigm (see Section 3). In this way, the multiplication only requires 2 bits and 2 rounds in the online phase.
The aforementioned scalar operation can be extended to tensor A or vector α by sharing the elements of A or α element-wise. We omit the detail here.

Neural Network
A neural network usually includes many linear and non-linear layers, all stacked on top of each other such that the output of the previous layer is the input of the next layer. We summarize the linear layers and non-linear layers as follows.
The linear layers usually include fully connected layer and convolution layer. Both can be computed by matrix multiplications and additions: • The fully connected layer can be formulated as y = W x + b, where y is the output of the fully connected layer, x is the input vector, W is the weight matrix and b is the bias vector.

•
The convolution layer can be converted into computing the dot product of the matrix and vector, and then one addition as shown in [20]; thus, it can be formulated as The non-linear layers introduce nonlinearity into neural networks and allow bound inputs to a fixed range, for example, evaluating the activation function. In this work, we only consider the rectified linear unit (ReLU) activation, which is defined as ReLU(x) = max(x, 0).

Quantization
Although there are many different quantization methods [21], we only consider the linear quantization method proposed by Jacob et al. [22] in this work. This is because the linear quantization method only involves linear operations, which benefits constructing an SS-based MPC protocol.
For 8-bit quantization, 32-bit float-point α ∈ R is quantized as an 8-bit integer a ∈ [0, 2 8 ) Z . The relationship between α and a is a dequantized function D S,Z : where S ∈ R + is called scale, and Z ∈ [0, 2 8 ) Z is called zero-point. As pointed out by Jacob et al. [22], both S and Z are determined at the training phase of the neural network; thus, (S, Z) is a constant parameter in the inference phase. We use a single set of quantization parameters for each activation array and weights array in the same neural network layer.
In order to convert FP32 to INT8, we define quantized function Q S,Z to be the inverse of D S,Z , then we have the following: where · is a rounding operation. Note that multiple numbers may map to the same integer due to the rounding operation; see Figure 1 (cf. [15]).
As an important part of QNN, when we compute the convolution of two quantized tensors, we have to compute the clamping function Clamp(x; a, b) = min(max(x, a), b) to bind the quantized result to [0, 2 8 ) Z , i.e., Clamp(x; 0, 2 8 − 1) should be computed. We refer the reader to [15,22] for more details.

The Architecture for Secure Inference
Our secure inference system is built on outsourced computation architecture and is given in Figure 2. The system has three different roles, which we describe as follows: • Server: There are three non-colluding servers in our system, denoted as P 1 , P 2 , P 3 . Three servers can be from different companies in the real world, such as Amazon, Alibaba, and Google; any collusion will damage their reputations. Similar to prior works, we assume that all servers know the layer types, the sizes of each layer, and the number of layers. All servers perform a series of secure protocols proposed in Section 4 to execute inference tasks for users' shared queries in a secure way. • User: The user holds some queries as input and wants to enjoy a secure inference service without revealing both queries and inference results to others. To do so, the user uses Equation (3) to convert the query to the 8-bit integer firstly, then uses ·sharing to split quantized queries to its masked shares before uploading to three servers, and receive the shares of inference results from three servers in the end. Note that only the user can reconstruct the final results; the privacy of both queries and inference results are protected during the secure inference. • Model Owner: The model owner holds a trained QNN model, which includes all quantized weights of different layers along with the quantization parameters. As an important intellectual property belonging to the model owner, the privacy of the QNN model should be protected. To do so, the model owner uses · -sharing to split quantized weights to its masked shares before deploying to three servers. Once the deployment is done, the model owner can go offline until the model owner wants to update the model. Similar to prior works of secure inference [8,20], we do not consider black-box attacks toward neural networks, such as model extraction attacks, model inversion attacks, and membership inference attacks, since these attacks are independent of the cryptographic techniques used to make the inference process secure [23].
As pointed out by Dalskov et al. [15], we might not enjoy the benefits of the size reduction when considering secure inference. Although data and network weights can be stored by 8-bit integer, the arithmetic operation must be computed modulo 2 . This work only focuses on reducing communication rounds and computation costs among three servers.
We use the offline-online paradigm to construct our secure protocols. This paradigm makes it possible to split the protocol into the offline phase and online phase, where the offline phase is independent of the input of the parties and the online phase depends on the specific input. We argue that the user occasionally raises inference requests; the servers will have enough time to process the offline phase to speed up the execution of upcoming inference requests [23].

Protocols Construction
According to Section 3, the model owner provides the weights of the layer and the quantization parameters to three servers, which allows us not to consider the impact of quantization. To construct an efficient, secure inference scheme in the WAN setting, we need to create a series of building blocks with constant rounds communication for the three servers, which is the goal of this section. Our main contribution here is to present a secure truncation, conversion, and clamping protocol for secure inference of three servers. The other protocols follow the previous work [19], but we still give details for integrity.

Secure Input Sharing Protocol
Let P i be the secret owner holding x. We define functionality F Share , which allows the parties to generate x . To achieve F Share , we follow [19] and show it in Protocol 1, which requires the communication of at most 2 bits and 1 round in the online phase. Offline:

•
If P i = P 0 : P 0 and P k for k ∈ {1, 2} together sample λ x k ∈ R Z 2 by using F Rand . • If P i = P k for k ∈ {1, 2}: P 0 and P k together sample λ x k ∈ R Z 2 , while P 0 and P 3−k together sample λ x 3−k ∈ R Z 2 , by using F Rand .
Observe that if both P i and P j hold the secret x, then x := (m x , λ x 1 , λ x 2 ) can be generated without any communication by setting some shares to 0 instead of using F Rand , which is inspired by [24]. For simplicity, we still use the same notation to denote this case, i.e., x ← F Share (P i , P j , x). To achieve F Share , Π Share can be done as follows:

Secure Truncation Protocol
Recall that when the parties execute secure multiplication protocol in the fixed-point value, we have to deal with the double-precision result. More precisely, when all shared values are represented as -bit fixed-point values with d-bit precision, then multiplying two fixed-point numbers, the result will be 2d-bit precision and must be truncated by d bits to keep right fixed-point representation. ABY3 [7] proposed the faithful truncation, which only works on RSS. Although [19] has the same semantics as us, they do not provide a secure truncation protocol in their work. In this work, we extend the faithful truncation to MSS as one of our contributions.
We define secure truncation functionality x ← F Trunc ( x , d), where x has 2d-bit precision, and x = x /2 d . Suppose that the parties hold x and random shared truncated pair (r, r d ), where r is a random value, and r d denotes the value of the r truncated d-bit, i.e., r d = r/2 d . The online phase of truncation can be performed by the parties to mask, reveal, truncate (x − r) in the clear, use r d to unmask, and obtain the truncated result x, i.e., x = (x − r)/2 d + r d .
The challenge here is to generate random shared truncated pair ( r , r d ) among the parties. To do so, we utilize the fact that if r d denotes the last d bits of r, then we have r = 2 d · r d + r d . Instead of sampling r by P 0 directly, P 0 and P j for j ∈ {1, 2} together sample random r j by using F Rand such that r = r 1 + r 2 can be locally computed by P 0 . In this way, P 0 can compute r d directly, and then share to P 1 and P 2 by invoking F Share . During the online phase, P 1 and P 2 reconstruct y = x − r and truncate to obtain y d , which follows by using r d to unmask the result. The protocol is described in Protocol 2, which requires the communication of at most 2 bits and 1 round in the online phase. Offline:

1.
P 0 and P j for j ∈ {1, 2} together sample random r j ∈ R Z 2 by using F Rand .

2.
P 0 locally computes r = r 1 + r 2 , and then truncates d bits to obtain r d .
P j for j ∈ {1, 2} locally computes and sends y j = x j − r j to P 3−j . 3. P 1 and P 2 locally reconstruct y = x − r and then truncate d bits to obtain y d .

4.
The parties locally generate y d ← F Share (P 1 , P 2 , y d ).

5.
The parties locally compute x = y d + r d .

Secure Conversion Protocol
We define F Bit2A to convert the Boolean shares of a single bit b B to its arithmetic shares b . To do so, we utilize the fact that if a and b are two bits, then a ⊕ b = a + b − 2ab.
Let b A be the value of bit b over Z 2 , then according to the fact and masked sharing semantics, we have In other words, Π Bit2A can be computed by invoking secure input sharing protocol and secure multiplication protocol of masked secret sharing. Note that P 0 holds both λ A b 1 and λ A b 2 , and thus u = λ A b 1 · λ A b 2 can be locally computed by P 0 without using Beaver triples.
To achieve Π Bit2A , we describe the construction in Protocol 3, which requires the communication of at most 2 bits and 1 round in the online phase. Offline:

1.
P j for j ∈ {1, 2} and P 0 together sample λ A b j ∈ R Z 2 using F Rand .

Secure Comparison Protocol
Comparison is an important building block of the neural network for evaluating ReLU activation, argmax function, and pooling layer. Fortunately, we can easily compare the quantized values if quantized a and b have the same quantization parameter (S, Z). This is because if α = S(a − Z) and β = S(b − Z), then α ≤ β holds if and only if a ≤ b holds. Therefore, the key step is to compute the most significant bit (MSB) of (a − b), i.e., a ≤ b if and only if MSB(a − b) = 1. Letting x = a − b, we define secure comparison functionality F MSB by giving shared value x and extract the Boolean shared bit c B such that c = (a ≤ ? b) = MSB(x).
The secure comparison protocol of ABY3 [7] needs log rounds in the online phase. To construct a constant-round comparison protocol, we implement Π MSB with the three-party GC proposed by [19].
Let GC(u 1 , u 2 , u 3 ) be a GC with inputs u 1 , u 2 ∈ Z 2 , u 3 ∈ {0, 1}, and output a masked bit y = MSB(u 1 − u 2 ) ⊕ u 3 . We treat P 0 and P 1 as the common garbler and P 2 as the evaluator. The circuits are generated by P 0 and P 1 with correlated randomness by using F Rand . Namely, both garblers hold the knowledge of GCs, including the keys and the decoding table in clear. In our situation, the parties hold x := (m x , λ x 1 , λ x 2 ); thus, we can define u 1 = m x − λ x 1 as the input of P 1 , u 2 = λ x 2 as the input of P 2 , and u 3 as a random bit sampled by P 0 and P 1 using F Rand .
Note that P 0 also knows u 2 and the corresponding key; hence, P 0 sends the key of u 2 to P 2 directly without using OT. P 2 evaluates the circuit to obtain y, then shares it with Π Share , which only requires communication of at most 2 bits. Finally, the parties remove masked bit u 3 B to obtain masked share c B = MSB(x) B . As pointed out by [25], the underlying circuit can be instantiated using the depthoptimized parallel prefix adder (PPA) of ABY3 [7]. GC can be further optimized by state-ofthe-art techniques, such as free-XOR [26] and half gates [27]. We describe the details in the following Protocol 4, which requires the communication of at most κ + 2 bits and 2 rounds in the online phase. Offline:

3.
P 0 and P 1 together garble GC and generate its decoding table by using F Rand .

4.
P 0 sends the key of u 2 , P 1 sends both GC and the table, to P 2 .

Online:
1. P 1 sends the keys of u 1 to P 2 .

2.
P 2 evaluates the circuit to obtain y.

4.
The parties locally compute c B = y B ⊕ u 3 B .

Secure Clamping Protocol
As pointed out by Section 2.5, when we compute the convolution of two quantized tensors, since rounding error exists, we may obtain the result c / ∈ [0, 2 8 ) Z , and hence a clamping operation c ← Clamp(c; 0, 2 8 − 1) should be computed [15].
Let u = (x ≤ ? a), then according to y = max(x, a), one has which is equivalent to the following Equation (5): Similarly, let v = (y ≤ ? b), then according to z = min(y, b), one has the following Equation (6): From Equations (5) and (6), one has and thus the key point of the secure clamping protocol here is how to securely implement Equation (7). Let e = x − b and f = a − x. Note that when we implement Equation (7) with masked secret sharing, both the shares of u and v are Boolean shares over Z 2 , while both the shares of e and f are arithmetic shares over Z 2 . In other words, we cannot invoke the secure multiplication protocol directly. This can be done by converting Boolean shares to arithmetic shares using secure conversion protocol and invoking secure multiplication protocol.
For simplicity, we formalize the above steps to be the bit injection functionality c ← F BitInj ( b B , x ): given the Boolean shares of a bit b and the arithmetic shares of x, secure bit injection functionality allows the parties to compute c = bx. We provide Π BitInj in Protocol 5, which requires the communication of at most 4 bits and 2 rounds. Now, we can give our secure clamping protocol in the following Protocol 6. Steps 5-6 can be computed in parallel within 2 rounds. Therefore, Protocol 6 requires the communication of at most 2κ + 12 + 4 bits and 8 rounds in the online phase.
The parties execute c ← F Mul ( b , x ).

1.
The parties locally compute e = x − b and f = a − x .

2.
The parties execute u B ← F MSB ( x − a ).

4.
The parties execute v B ← F MSB ( y − b ).
The parties locally compute z = b + g + h .

Theoretical Complexity
The total communication and round complexity of our protocols are provided in Table 3. It is easy to see that all our protocols have constant-round communication in the online phase. Table 3. The communication and round complexity of our protocols, where denotes the logarithm of the ring size, and κ denotes security parameter. All communications are reported in a number of bits.

Protocol
Offline Online

Security Analyses
This section gives proof sketches of our protocols in the real-ideal paradigm. We present the steps of the simulator Sim for A in the stand-alone model with security under sequential composition [28]. The proof works in the F Rand -hybrid model. Theorem 1. Π Share securely realizes the functionality F Share in the F Rand -hybrid model and against a semi-honest adversary A, who only corrupts one single party.
Proof. Given the ideal F Rand , the output of PRFs is a pseudo-random value, which can be simulated by Sim uniformly samples random value. Note that P i sends m x to P j , m x is masked by random λ x i , which is unknown to P j ; hence, corrupted P j cannot learn any information of x. In short, the view of A in real execution is computationally indistinguishable from the view of Sim in ideal execution. Theorem 2. Π Trunc securely realizes the functionality F Trunc in the (F Rand , F Share , F Mul )hybrid model and against a semi-honest adversary A, who only corrupts one single party.
Proof. Given ideal F Rand , F Share and F Mul , the correlated randomness can be simulated by invoking F Rand . Then, Sim invokes F Share to simulate step 4 in the offline phase. Finally, Sim invokes F Mul to simulate step 2 in the online phase. Note that all functionality of the output is the random shares over Z 2 , and hence the view of A in the real execution is computationally indistinguishable from the view of Sim in the ideal execution. Theorem 3. Π Bit2A securely realizes the functionality F Bit2A in the (F Rand , F Share , F Mul )hybrid model and against a semi-honest adversary A, who only corrupts one single party.
Proof. The security of Π Bit2A can be reduced to the security of Π Share and Π Mul , which was proven to be secure in Theorem 1 and [19], respectively. Since we make only black-box access to F Share and F Mul , according to the sequential composition, the Bit2A protocol is secure in the semi-honest model. Proof. Given the ideal functionality F Rand , F Share , the security of Π MSB is trivial for P 0 and P 1 . This is because both u 1 and u 2 are unknown to P 0 and P 1 at the same time. Because the parties are non-colluding, we have that y is oblivious to the corrupted party, even if both garblers have the circuit in the clear. Observe that P 2 evaluates the circuit to obtain y, masked by common random bit u 3 of P 0 and P 1 . In other words, y is uniformly random to P 2 . Therefore, the view of A in real execution is computationally indistinguishable from the view of Sim in the ideal execution.
Theorem 5. Π BitInj securely realizes the functionality F BitInj in the (F Bit2A , F Mul )-hybrid model and against a semi-honest adversary A, who only corrupts one single party.
Proof. The security of Π BitInj can be reduced to the security of Π Bit2A and Π Mul , which was proven to be secure in Theorem 4 and [19], respectively. Since we make only black-box access to F Bit2A and F Mul , according to sequential composition, the bit injection protocol we proposed is secure in the semi-honest model. Theorem 6. Π Clamp securely realizes the functionality F Clamp in the (F MSB , F BitInj )-hybrid model and against a semi-honest adversary A, who only corrupts one single party.
Proof. The security of Π Clamp can be reduced to the security of Π MSB and Π BitInj , which was proven to be secure in Theorems 4 and 5, respectively. Since we make only black-box access to F MSB and F BitInj , according to the sequential composition, our secure clamping protocol is secure in the semi-honest model.

Quantized Neural Network Structure
We consider the convolutional neural network presented in Chameleon [2], which includes a single convolution layer and two fully connected layers. The activation function is ReLU activation. We consider its quantized variant as our QNN structure and describe it in Figure 3. As we pointed out above, we set all data types of QNN from FP32 to INT8.
Instead of evaluating the original ReLU activation, we evaluate ReLU6 activation, as fixed ranges are easier to quantize with high precision in different channels and a quantized model with ReLU6 has less accuracy degradation [22]. Herein, ReLU6 activation is defined as ReLU6(x) = min(max(x, 0), 6) = Clamp(x; 0, 6), which is essentially a clamping operation. It seems that we have to invoke a secure comparison protocol to evaluate ReLU6 activation.
In fact, as pointed out by [22], we can take advantage of quantification such that ReLU6 can be entirely fused into the computation of the inner product that precedes it. To do so, we can directly set the quantized parameters to be S = 6/255 and Z = 0, then α = S(a − Z) ∈ [0, 6] always holds for any a ∈ [0, 2 8 ) Z . By doing this, we can clamp the inner product to a ∈ [0, 2 8 ) Z , meanwhile evaluating ReLU6 activation. Namely, we can evaluate ReLU6 activation without any communication overhead.
In addition, the evaluation of the argmax function can be computed by invoking the secure comparison protocol.

Experimental Setup
We implemented our protocols with Python. All our experiments were executed on a server over Ubuntu 20.04 LTS, which is equipped with Intel(R) Xeon(R) Gold 5222 CPU processor (@3.80GHz) and 32GB RAM memory with AES-NI support. Three parties were simulated by three different terminal ports. We used the Linux traffic tools command tc to simulate LAN and WAN. Specifically, we considered the LAN setting with 625 Mbps bandwidth and 0.2 ms ping time, and the WAN setting with 80 Mbps bandwidth and 20 ms ping time. Note that these parameters are close to the ones we use daily, proving that our solution is practical.
All experiments were executed 10 times on our server to eliminate accidental errors and reported results with the average. We set the bit-length of the shares = 64, the fixed-point precision d = 13, and the security parameter κ = 128.
To simplify the experiment, we also made the following assumptions: • We suppose that the input of the user was taken from the MNIST dataset [29], which contains 60,000 training images and 10,000 testing images of handwritten digits. Each image is represented as 28 × 28 pixel with values between 0 and 255 in greyscale. Note that all greyscales are stored with 8-bit integers already, which eliminates the need for data type conversions. • We assume that the model owner shared the quantized parameters of each layer among all servers. In short, quantized parameters are encoded to all layers.

Experimental Results for Secure Inference
In our experiment, we compare our solution to two-party framework Chameleon [2] and various three-party frameworks, including BANNERS [12], SecureBiNN [13] and Se-cureQ8 [15]. Note that both Chameleon and BANNERS are not publicly available; hence, we use their reported results directly for reference. Both BANNERS and SecureBiNN are designed for binary neural network inference. We also compare our solution to SecureQ8, which was also based on INT8 quantized and implemented by MP-SPDZ [30] in the same setting. The experimental results of both LAN and WAN are reported in Table 4. All communication is reported in MB, and runtimes are in seconds.  The author of Chameleon [2] claims that the original network gives us accuracy of 99%. However, our experiment shows that the accuracy is less than 80% when we convert it into a quantized variant as shown in Figure 3. Therefore, instead of reporting the Top-1 accuracy of the model, we reported its Top-5 accuracy, where the truth label is among the first five outputs of the model. In this way, our proposed solution gives us Top-5 accuracy of 98.4%. Note that the reported accuracy of different frameworks is only for reference since it may depend on the model parameters. This is beyond the scope of our work.
As shown in Table 4, almost all quantized frameworks are faster than the nonquantized scheme Chameleon in the same setting. The communication cost of the quantized frameworks is also less than that of the nonquantized scheme. In addition, INT8 quantized schemes are better than binarized schemes in terms of Top-5 accuracy, but the latter have lower communication costs and runtimes.
Compared to Chameleon, due to the quantization technique, our protocols were 1.41 times and 1.94 times faster in the LAN and WAN settings, respectively. In addition, our protocols were 1.32 times lower in online communication. Compared to SecureQ8, our scheme was 1.11 times slower in the LAN setting, but 1.5 times faster in the WAN setting. Because our protocols have constant-round complexity, it is suitable for a low-bandwidth and high-latency network. Note that the online communication costs of our scheme were slightly larger than SecureQ8, as our comparison protocol is based on three-party GC, where the decoding key is related to security parameter κ.
Our protocols also enjoy the benefit of the offline-online paradigm. Specifically, most of the communication cost of the online phase is transferred to the offline phase, which makes our scheme more efficient than SecureQ8 in the online phase, especially in the WAN setting. To see this more clearly, we also plot a performance comparison of batch inferences in Figure 4.

Conclusions
We proposed a series of three-party protocols based on MSS in this study. Our key contribution is more communication-efficient building blocks for QNN inference. Our experiment shows that our protocols are suitable for low-bandwidth and high-latency environments, especially in the WAN network. All these blocks can also be used in other applications as long as the underlying sharing semantics are the same as ours.
Our constant-round comparison protocol is built on GC, and although free of OT, the online communication is related to the security parameter κ. How to construct a constantround secure comparison protocol such that the online communication cost is independent of security parameters is still an open problem.
Moreover, we only consider a semi-honest adversary with Q3 structures (i.e., the adversary corrupts no more than 1/3 parties). Achieving security against other adversary structures with malicious adversaries will be the future work.